This is my attempt at entering the kaggle Digit Recognizer competition:
https://www.kaggle.com/c/digit-recognizer
The kaggle data incorporates approx.40,000 entries with avg_pixelss from 0-255 with 785 columns or pixel positions. There is an even distribution of entries for the 10 numbers/outcomes (0-9).
The data has already been processed in Python as follows:
all columns with constant values have been removed (< 200 columns)
the data was grouped by number/outcome and all columns with pixels <= 20 were deleted. This cuts down the total column names to:
## [1] 434
This should help me visualise if any further columns should be removed or whether or not I can create any composite features.
I then drew a few graphs to get a handle on the data.
There seem to be a few patterns here:
0 - more than 40 avg_pixels 150-175 range 1 - only number avg_pixels over 225 6 - only number with avg_pixels under 20?
Answer: Yes, they seem to be relatively well distributed for each number. Dead end here?
The box plots look interesting.
There is a much clearer distinction between the different numbers/outcomes above 100 pixels, even more so above 150.
Double check whether this distinction applies across columns names as well?
Yes…
The distribution is less even across the numbers particularly under 200 and over 700. This could help the algorithm. The problem is, a lot of them are under 150 which I was considering removing…
I’ll move to the Machine Learning stage with four different datasets:
all columns with constant values and avg pixels values < 20 removed - TOTAL COLUMNS: 434
as above but all columns with avg pixels values < 100 removed - TOTAL COLUMNS: 271
as no.1 but all columns with avg pixels values < 150 removed - TOTAL COLUMNS: 191
as no.2 but including columns wih names < 200 and > 700 - TOTAL COLUMNS: 331
as no.3 but including columns wih names < 200 and > 700 - TOTAL COLUMNS: 263